Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes bug where multiple nodes fight over an eip #435

Merged
merged 4 commits into from
Mar 27, 2024

Conversation

jubrad
Copy link
Contributor

@jubrad jubrad commented Mar 15, 2024

    - introduces resource_ids to eip status
    - moves setting attachment status to be
    before an attempt at attaching
    - moves status update from patch to replace
    this should raise errors if the resource has
    changed allowing the resource_id to function
    somewhat as a lock
    - resource won't attempt to claim a resource
    with a resource id
    - fixes cargo deny errors

Alternative to #348

I believe this still has issues:

  • ensuring new eips that get created are associated with the correct resources in a timely manner.
  • the dns target annotation is not removed for pods
  • there are no guarantees that a eip who has lost its resource id will be assigned to a new eip

The main issue with an approach like this:
when a node/pod drops an eip we're left up to the pod/node reconciliation cycle to reassign. The EIP will reconcile due status change, but without adding considerable amount of code like was done in 348, EIPs won't attach themselves to resource or cause resources to re-reconcile leading them setting up the attachment to the eips.
^ Update to this
I've now pushed a change where the eip will use a dynamic resource provider to find all resource matching it's selector and update a label with a timestamp leading the resource to be reconciled. This seems to work really well. Let me know if there are concerns.

@jubrad jubrad requested a review from a team as a code owner March 15, 2024 05:41
@jubrad jubrad force-pushed the fix-contention-issue-simple branch from 10ee2a8 to e2c9b82 Compare March 15, 2024 05:53
@jubrad jubrad force-pushed the fix-contention-issue-simple branch from e2c9b82 to 5699eed Compare March 15, 2024 17:02
        - introduces resource_ids to eip status
        - moves setting attachment status to be
        before an attempt at attaching
        - moves status update from patch to replace
        this should raise errors if the resource has
        changed allowing the resource_id to function
        somewhat as a lock
        - resource won't attempt to claim a resource
        with a resource id
        - fixes cargo deny errors
@jubrad jubrad force-pushed the fix-contention-issue-simple branch from 5699eed to 484a96e Compare March 15, 2024 17:38
Err(err)
if err
.to_string()
.contains("Operation cannot be fulfilled on eips.materialize.cloud") =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What error type is this? If it is from Kubernetes or something we defined, we should be able to match a more specific type than just doing string matching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an eip_operator::Error that comes from a kube::Error. I want to say I attempted to match an error, but this was just easier and iirc I might've needed a guard regardless.

"Pod {} failed to claim eip {}, rescheduling to try another",
name, eip_name
);
return Ok(Some(Action::requeue(Duration::from_secs(1))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have this special handling at all? if the reconcile loop returns an error, it will already log an error and schedule a rereconcile in the very near future - i don't quite understand what we gain from handling this manually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're expecting this as a valid possibility due to a conflict and we're able to appropriately handle it is there really an error? I can just return the error, but I think it makes logs more confusing, than having an info message that indicates this sort of thing was expected.

@jubrad jubrad force-pushed the fix-contention-issue-simple branch from bd85c36 to 4f7efc7 Compare March 23, 2024 06:02
@jubrad jubrad force-pushed the fix-contention-issue-simple branch 3 times, most recently from 1edfaac to 1c3cb13 Compare March 26, 2024 03:50
@jubrad jubrad force-pushed the fix-contention-issue-simple branch from 1c3cb13 to b3bd1b0 Compare March 26, 2024 05:55
Copy link
Collaborator

@alex-hunt-materialize alex-hunt-materialize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I have one remaining question and a few nitpicks, but all the logic looks correct to me.

s.resource_id.is_none()
|| s.resource_id.as_ref().map(|r| r == &name).unwrap_or(false)
})
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we handle the migration from older versions of the EIP operator, which don't have resource_id's? Is this OK just because we don't have multiple nodes that match currently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, there could be some re-alignment during this migration if there are multiple nodes matching a single eip, but it shouldn't be any more unstable than the existing implementation.

Copy link
Collaborator

@alex-hunt-materialize alex-hunt-materialize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a bunch of nitpicks in my previous review comments, but won't block the PR on those.

@jubrad jubrad merged commit 164383c into main Mar 27, 2024
3 checks passed
@jubrad jubrad deleted the fix-contention-issue-simple branch March 27, 2024 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants